Building An MT Dictionary From Parallel Texts Based On Linguistic And Statistical Information
نویسندگان
چکیده
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translations of compound nouns and over 50% of unknown words are obtained as tbe first candidate from small Japanese/Englisb parallel texts containing severe distortions. 1 I N T R O D U C T I O N Parallel texts (corpora) are useful resources for acquiring a variety of linguistic knowledge (Dangan, 1991; Matsumoto, 1993), especially for machine translation systems which inherently require customizations. Translation dictionaries are, needless to say, the most basic and powerful knowledge source for improving and customizing translation systems. Our research interest lies in automatic generation of translation dictionaries from parallel texts. In this perspective, finding corresponding words or phrases in bilingual texts will be the fundamental factor for accurate translation. Statistics-based processing has proven to be very powerful for aligning sentences and words in parallel corpora (Brown, 1991; Gale, 1993; Chen, 1993). Kupiec proposes an Mgorithm for finding ~loun phrases in bilingual corpora (Kupiec, 1993). In this algo o rithm, noui~-phrase candidates are extracted from tagged and aligned parallel texts using a noun phrase recognizer and tile correspondences of these nonn phrases are calculated based on the EM algorithm. Accuracy of around 90% has been attained for the Imndred highest ranking con'espondenccs. Statisticsbased processing is effective when a relatively large amount of parallel texts is available, i.e. when high frequencies are obtained. On the other hand, existing linguistic knowledge can be used for finding corresponding words or phrases in parallel texts. For example, possible target expressions for a source expression provided by a translation system (linguistic knowledge source) can be a key in searching the corresponding expressions in a corpus (Nogami, 1991; Katoh, 1993). Yanramoto (1993) proposes a method for generating a translation dictionary from Japanese/English parallel texts. In this method, English and Japanese compound noun phrases are extracted from parallel texts and their correspondences are searched by matching their possible translations generated by tile existing translation dictionary. However, acquirable noun phrases are limited by tile linguistic generative power of the translation dictionary. Furthernlore, tiffs method utilizes no sentence alignmeat information which can reduce errors in finding noun phrase correspondences. This paper proposes a new method for generating an MT dictionary from parallel texts. It utilizes both statistical and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by the above linguistic-based method can be extracted, and a highly accurate translation dictionary is generated from relatively small par:dlel texts. 2 A P P R O A C t t T O B U I L D I N G A N M T 1 ) I C T I O N A R Y Our goal in building an MT dictionary from parallcl texts is to develop a robust method which enables highly accurate extraction of translation pairs from a relatively small amount of parallel texts as well as from parallel texts containing severe distortions. In real-world applications, generally it is extremely difficult especially for MT users to obtain a large amount of high quality parallel texts of one specific domain. If source and target languages do not belong to the same linguistic family, like Japanese and Fnglish, tile situation becomes grave. As one typical example of MT dictionary compilation, we have selected Japanese and English patent doemnents which contain many state-of-the-m~t technical terms. Althougb thes~ documents are not cul-
منابع مشابه
BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION
A method for generating a machine translation (MT) dictionary from parallel texts is described. This method utilizes both statistical information and linguistic information to obtain corresponding words or phrases in parallel texts. By combining these two types of information, translation pairs which cannot be obtained by a linguistic-based method can be extntcted. Over 70% accurate translation...
متن کاملCorpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo
In this paper, we present the first attempts to develop a machine translation (MT) system between Spanish and Shipibo-konibo (es-shp). There are very few digital texts written in Shipibo-konibo and even less bilingual texts that can be aligned, hence we had to create a parallel corpus using both bilingual and monolingual texts. We will describe how this corpus was made, as well as the process w...
متن کاملIterative, MT-based Sentence Alignment of Parallel Texts
Recent research has shown that MT-based sentence alignment is a robust approach for noisy parallel texts. However, using Machine Translation for sentence alignment causes a chicken-and-egg problem: to train a corpus-based MT system, we need sentence-aligned data, and MT-based sentence alignment depends on an MT system. We describe a bootstrapping approach to sentence alignment that resolves thi...
متن کاملImproving Word Alignment Quality Using Linguistic Knowledge
Word alignment of bilingual parallel corpora is usually generated using only statistical information. External linguistic information like e.g. a dictionary or linguistic structural annotation of the texts is used rarely, despite its usefulness. Additionally, it has to our knowledge never been examined systematically how linguistic information can be employed for word alignment improvement. In ...
متن کاملFrom free shallow monolingual resources to machine translation systems: easing the task
The availability of machine-readable bilingual linguistic resources is crucial not only for machine translation but also for other applications such as cross-lingual information retrieval. However, the building of such resources demands extensive manual work. This paper describes a methodology to build automatically bilingual dictionaries and transfer rules by extracting knowledge from word-ali...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994